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This is an intriguing paper that raises important 
questions, and I feel privileged for being invited to 
discuss it. The paper deals with a very basic problem 
of sample surveys: how to weight the survey data in 
order to estimate finite population quantities of in- 
terest like means, differences of means or regression 
coefficients. 

The paper focuses for the most part on the com- 
mon estimator of a population mean, = J27=i ''J^iVi/ 
Z]r=i^i) s-^d discusses different approaches to con- 
structing the weights by use of linear regression mod- 
els. These models vary in terms of the number and 
nature of the regressors in the model and in the as- 
sumptions regarding the regression coefficients, 
whether fixed or random with prespecified distri- 
butions. The idea behind regression weighting is to 
include in the regression model all the variables and 
interactions that are related to the outcome values 
and affect the sample selection and the response 
probabilities, such that the sampling and response 
mechanisms are ignorable in the sense that the model 
fitted to the observed data is the same as the pop- 
ulation model before sampling. Assuming that all 
the important regressors affecting the sample selec- 
tion and response are discrete, the set of all pos- 
sible combinations of categories of these variables 
defines poststratification cells, which in turn define 
the dummy independent variables in the regression 
model. The target population parameter of inter- 
est can be written then as 9 = J2j=i ^j^^j / J2j=i ^ji 
where 9j is the parameter for cell j (say the true cell 
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mean, Yj),Nj is the cell size and J is the number of 
cells. The regression estimator has the general form 
e^'^ = E/=i^i^i/E/=i^i- For example, the esti- 
mator of the population mean is Y^^ = J2j=i ^jVj/ 
J2j=i ^ji where yj is the sample mean in cellj. 

The discussion that follows is divided into two 
parts. In the first part I comment on the proposed 
weighting approach and some of the statements made 
in the article. In the second part I consider another 
approach for analyzing survey data that are subject 
to unequal sample selection probabilities and non- 
response, and compare it to the approach taken in 
this paper. 

1. REMARKS ON THE PAPER 

• The first obvious remark, that is also made al- 
ready in the Abstract, is that the number of post- 
stratification cells can be extremely large, with in- 
evitably very small or no samples in many of the 
cells. Having small or no samples in some or even 
most of the cells is theoretically not a problem un- 
der the model with random regression coefficients 
considered later, but it is not clear what should be 
done in such a case under the standard regression 
model with fixed coefficients. Note in particular the 
problems arising if the zero sample sizes are due to 
nonresponse. Deleting these cells from the regression 
model may violate the sample ignorability assump- 
tion. It is stated in Section 3.1 that it is not required 
to include in the model all the cells, but this raises 
the question of which cells to exclude and based on 
what criteria. It may imply also including different 
cells (interactions) for different outcome variables of 
interest. 

• It is assumed that the cell sizes are known. This 
could be a strong assumption in a large-scale survey 
with many small cells. Also, it is often the case that 
the cell sizes are known to the person drawing the 
sample, but not necessarily to the person analyzing 
the data, who has limited access to the original files 
due to confidentiality restrictions or other reasons. 
The argument that the cell sizes can be estimated 
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using iterative proportional fitting or other related 
procedures is well taken, but this raises questions 
of the effect of using estimated sizes on the perfor- 
mance of the estimators and how to estimate the 
variances, accounting for this source of variability. 

• A third problem and in a way the most difficult 
one to handle is the implicit assumption that the 
analyst knows all the variables affecting the sample 
selection and nonresponse. Here again a distinction 
should be made between the person drawing the 
sample who should at least know all the variables 
that affect the sample selection, and the person an- 
alyzing the data who may not even have that infor- 
mation. When it comes to nonresponse, both per- 
sons can only hypothesize which variables explain 
the nonresponse. I should add to this that the paper 
implicitly assumes that the missing data are missing 
at random (MAR), which of course may not be the 
case in practice. The alternative approach described 
later overcomes in principle these problems but it 
requires modeling the sample inclusion probabilities 
as a function of the observed data. 

• It is mentioned that computing the variances 
of weighted estimators may not be trivial, because 
the weights are generally random variables that de- 
pend on the data. I can see that weighting cells that 
account for nonresponse are "data driven," but for 
given cells, the computation of the variances should 
not be complicated, even though the response proba- 
bilities are only estimated. Thus, a distinction should 
be made between conditional and unconditional vari- 
ances. A more crucial distinction, however, is be- 
tween variances and mean square errors, because as 
already implied by my previous comment, the main 
issue is whether the cells are defined correctly and 
the nonresponse is indeed MAR. 

• The paper proposes a two-step procedure for 
estimating the regression of y on z. The first step 
consists of regressing y on z and X and interactions 
between them, where X represents the variables af- 
fecting the sample inclusion probabilities; the sec- 
ond step consists of regressing X on z m order to 
obtain the regression of y on z alone [(4) in the pa- 
per]. I have no problem with this approach, but as 
the paper repeatedly emphasizes the regression (av- 
eraging) in the second step must be adjusted for 
the population distribution of X. If this distribution 
is unknown, which may well be the case in prac- 
tice, one is bound to use some sort of weighting in 
the back door. Thus, an alternative "weighted re- 
gression" procedure favored by survey analysts is to 



regress y on z alone, but use weighted regression 
with the weights defined by the inverse of the sam- 
ple inclusion probabilities. Consider the example in 
the paper of regressing log earning against ethnic- 
ity (white/nonwhite) in order to estimate the differ- 
ence E{y\white = 1) — -E'(y|white = 0). Suppose that 
the survey oversamples males. It is argued that the 
model should include in this case as additional re- 
gressors "gender" and the interaction between white 
and male, and then obtain the regression of log earn- 
ing on ethnicity by applying the second step de- 
scribed above. This model accounts for possible dif- 
ferences between the effects of the two genders on 
the log earning for a given ethnicity, and is thus the 
"correct model," irrespective of the sample inclu- 
sion probabilities. Application of the weighted re- 
gression approach to the example consists in this 
case of regressing y against z (defined by two dummy 
variables representing "white" and "nonwhite" ) and 
weighting each sample value by the inverse of the 
sample inclusion probability. Denoting the sample of 
"white" by 5*1 and the sample of "nonwhite" by 5*2, 
the resulting estimator is (EieSi 'WiVii/Y.i&Si ^j) " 
(EiGSa ^i^ii/I^jeS'a ^i)- Clearly, if the model with 
the gender variable and the interaction term is the 
correct model, the model without them is the wrong 
model and weighting the sample observations does 
not correct the model. However, as long as the weights 
are estimated appropriately (accounting for the sam- 
ple selection and response probabilities), the use of 
this procedure yields a consistent estimator for the 
difference of interest. I believe that many analysts 
would use weighted regression even when fitting the 
"correct model," so as to protect against other pos- 
sible model misspecifications. 

• It is mentioned in Section 3.1 that the full post- 
stratification estimator of the population mean, 
9^^ = J2j=i ^jVj/ S/=i ^ji can be viewed as a clas- 
sical regression estimator by including indicators for 
all the poststratification cells. How are the sizes Nj 
captured by the regression model? Is it not a weight- 
ed regression estimator? 

• It is stated that weighted regression is not fiexi- 
ble and that it is not clear how to apply the weights. 
I do not think that this is correct. The use of pseudo 
likelihood methods, for example (see the discussion 
and references in Pfeffermann, 1993), is well estab- 
lished and very common. See also below for a model- 
based justification for weighted regression. The use 
or nonuse of the weights has nothing to do with the 
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use of models for small area estimation problems, as 
seems to be suggested in Section 4. 

• As pointed out in the paper, the use of hier- 
archical models implies different sets of weights for 
different outcome variables. Statistical bureaus do 
not like this and usually insist on a single weight 
for a given sample unit, even at the risk of loss of 
efficiency. To highlight this problem a bit further, 
suppose that one is interested in three variables, yi, 
2/2 and ya =5(2/1,2/2) for some function g. Say 2/1 
is total earnings in a given month, 2/2 is the num- 
ber of hours worked and 2/3 is the mean earning per 
hour. Fitting a hierarchical model to each of the 
three variables separately would imply three differ- 
ent sets of weights, which some would argue does not 
make sense in this case, beyond not being practical. 

• The use of the normal model with independent 
random effects for the J cell means does not seem 
appropriate if the cells are defined by interactions 
of the regressors that account for the sample selec- 
tion and nonresponse. Some of these cells are "close" 
to each other, say the cells defined by given cate- 
gories of gender, ethnicity and level of education, 
and adjacent categories of age, and other cells are 
very apart. Thus, it is more appropriate in this case 
to fit a model with spatial correlations between the 
random effects that reflect the distance between the 
corresponding cells. The computation of the weights 
under the model is neat. Note that with many cells 
and very small sample sizes within the cells, the cell 
predictor 9^ in (10) will often be close to the syn- 
thetic estimator /i in (11), which is then also approx- 
imately the estimator of the population mean. As a 
result, the weight will be approximately constant. 

2. ALTERNATIVE APPROACHES 

As discussed above, a major problem with the ap- 
plication of the approach proposed in this article is 
that it requires knowledge of all the important vari- 
ables affecting the sample selection or nonresponse 
(the X variables). As argued by Alexander (1987), 
"no model will include all the relevant variables and 
few analysts will wish to include in the model all 
the geographic and operational variables which de- 
termine sampling rates. The theoretical and empiri- 
cal tasks of fitting and validating such models seem 
formidable for many surveys." 

One way to deal with this problem, considered 
by Rubin (1985), is to use the vector of sample 
inclusion probabilities as a surrogate for the vari- 
ables in X, but as further discussed in Smith (1988), 



this approach is not always valid and in the case 
of nonresponse, the true inclusion probabilities are 
unknown and need to be estimated. Skinner (1994) 
models the outcomes in the sample as a function of 
the model covariates and the sampling weights, and 
the sampling weights in the sample as a function of 
the model covariates, and shows how to obtain the 
model for the outcomes in the population from these 
two models. 

In what follows I outline briefly the basic ideas of 
another approach for estimating population models 
and predicting finite population quantities. This ap- 
proach models the sample data and bases the infer- 
ence on the sample model. See the references below 
for more details with examples and applications. I 
consider for convenience single stage sampling and 
assume that the sample selection and response are 
independent between the sampling units. As before, 
denote by y the outcome variable and suppose first 
that one is interested in identifying and estimating 
the population model fp{y\z), where z is a set of co- 
variates. Following Pfeffermann, Krieger and Rinott 
(1998), the sample model is defined as 

fs{yi\zi)'^= f{yi\zi,ies) 

^ Pr(i g s\yi,Zi)fp{yi\zi) 
^ ^ Pr(iGs|zi) 

^ Ep{TTi\yi,Zi)fp{yi\zi) 
Ep{i^i\zi) 

where VTj = Pr(z G s) is the sample inclusion proba- 
bility (probability to be selected and respond). 

Remark 1. By (1), the sample model is the 
same as the population model if Pr(i € s\yi,Zi) = 
Pr(z S s\zi) \/yi, in which case the sampling process 
is ignorable. 

Remark 2. Pr(i € s\yi,Zi) is generally not the 
same as vTj, which may depend on the variables in 
X and possibly also on the 2/- values in the case of 
NMAR nonresponse. However, the use of the sam- 
ple model only requires modeling Pr(i £ s\yi,Zi) or 
Ep{-Ki\yi, Zi), thus circumventing the need to know 
the variables X and incorporate them in the model. 
Note that the sample model resulting from model- 
ing the sample inclusion probabilities can be tested 
using standard goodness-of-fit test statistics, since 
the sample model refers to the sample data. 

The following relationship between the population 
model and the sample model is established in Pfef- 
fermann and Sverchkov (1999), where Wi = l/vTj and 
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Es{-) is the expectation under the sample model: 



(2) 



Es{wi\yi,Zi)fs{yi\zi 
Esiwilzi) 



Thus, one can identify and estimate the population 
model by fitting the sample model to the sample 
data and estimating the expectations Es{wi\yi, Zi), 
again using the sample data. Clearly, both the sam- 
ple model and the expectations Es{wi\yi, Zi) depend 
in general on unknown parameters. Pfeffermann and 
Sverchkov (2003) discuss alternative approaches of 
estimating these parameters, with examples. Note 
in this respect that if the outcomes are independent 
under the population model, they are also "asymp- 
totically independent" under the sample model when 
increasing the population size but holding the sam- 
ple size fixed. See Pfeffermann, Krieger and Rinott 
(1998) for details. 

Remark 3. For likelihood- or Bayesian-based 
inference, one can employ the "full likelihood" of the 
sample data and the sample membership indicators, 

f{s,ys\zs,Zs) = J|Pr(z G s\yi, Zi)fp{yi\zi) 



(3) 



.l[[l-PrijGs\z,)], 



where Pr(j & s\zj) =/Pr(j G s\yj, Zj)fp{yj\zj) dyj is 
the propensity score for unit j; see, for example. Gel- 
man et al. (2004) and Little (2004). The use of (3) 
has the advantage of employing the information on 
the sample selection probabilities for units outside 
the sample, but it requires knowledge of the covari- 
ates for every unit in the population, unlike the use 
of the sample likelihood that is based on the sample 
model. Modeling the joint distribution of the covari- 
ates and integrating them out of the likelihood is 
often too complicated. 

Remark 4. I mentioned before that the use of 
weighted regression can be justified theoretically. 
Suppose that the population model is yi = z[l3 + ; 
Ep{e,\z{)=Q, Ep{e^\zi)=al By (2), 



f3 = argmin Ep{yi - z[0f 

/3 



(4) 



arg min Eg 

/3 



Wi{yi - z[j3) 



2n 



Esiwi) 



aigm\-D.Es[wi{yi - z[l3f] 

/3 



noting that Es{wi) = [N/E{n)\ = const. Replacing 
the sample expectation in the right-hand side of (4) 
by the sample mean yields the weighted regression 

^ Eigs WiZiyi as the op- 



estimator byj = 
timal (least squares) solution. 



WiZiZ^\ 



Remark 5. By conditioning on Zi and hence 

one obtains the esti- 



■ • ■ ■ n iWiiyi—z'B)'^ 

mmmiizmg -C/.^l ^ i . \ \Zi 



Es(wi\zi) 

mator hq = [Y^i^s <liZiz'i\~^ J2ies QiZiVi, where qi = Wij 
Es{wi\zi). The weights {qi\ account for the net sam- 
pling effects on the conditional target distribution 
fpiUilzi), and the estimator bq is therefore less vari- 
able than bw See Pfeffermann and Sverchkov (1999) 
for further discussion and empirical comparisons be- 
tween the two estimators. 

How can the sample model be used for estimating 
finite population totals or means? For this we need 
to define the sample-complement model. 



fciVilZi) 



dcf 



(5) 



Pr(i ^ s\yuZi)fp{yi\zi) 
Pr(z ^ s\zi) 

Es[{wi - l)\yi,Zi\fs{yi\zi) 



Es[{w^ - l)\Zi] 

with the last equality shown in Sverchkov and Pfef- 
fermann (2004). Note that the sample-complement 
model is again a function of the sample model and 
the expectation Es{wi\zi), and thus can be estimated 
from the sample data. The optimal predictor of the 
population total under a quadratic loss function is. 



(6) 



Esjjwj - l)yj\zj] 
Es[{wj-l)\zj] ' 



The last equality follows from (5), with the sample 
expectations in the numerator and the denomina- 
tor either being modeled based on the sample data 
or simply estimated by the corresponding sample 
means by application of the method of moments. 
As shown in Sverchkov and Pfeffermann (2004), fa- 
miliar estimators of finite population means such 
as the estimator y^ = J27=i ^i?/*/ SiLi ^« studied in 
the present paper are obtained as special cases of 
this theory by specifying appropriate population or 
sample models. Pfeffermann and Sverchkov (2007) 
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use the sample and sample-complement models for 
small area estimation under informative sampling of 
areas and within the areas. 

To summarize, the alternative approach outlined 
above has the advantage of not requiring incorporat- 
ing in the model the variables affecting the sample 
selection and response, unless they are part of the 
covariates that define the target model of interest. It 
can be applied also in situations where the response 
process is NMAR. However, it requires modeling the 
expectation Es{wi\yi, Zi), which may not be easy in 
the presence of nonresponse. On the other hand, as 
mentioned before, the resulting sample model can be 
tested using classical goodness-of-fit statistics, since 
the sample model refers to the sample data. 
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